Video Game Sales and Performance Analysis - Group 6

1 Introduction

The video game industry is a multi-billion dollar industry that is still expanding in terms of both influence and popularity. People of all ages and backgrounds love playing video games, and they have a significant impact on our culture and society.Using information that has been scraped from Metacritic,

1.1 Objective

This research project’s goal is to continue to explore multivariate relationships in the data and leverage machine learning models to predict a game’s success based on its characteristics. At the end of our project, we hope to answer the question of “What makes a game successful?”

1.2 Data Source

The dataset used for this report was sourced through web scraping from Metacritic. It is important to note that certain observations are missing, as Metacritic covers only a subset of the gaming platforms. As a result, some games may not have complete information for all the variables. However, the dataset contains approximately 6,900 complete cases.

The dataset has information on the following attributes:

Game Specific:

Name: The name of the video game

Genre: The genre of the video game

Platform: The platform(s) on which the video game was released

Developer: The developer of the video game

Publisher: The publisher of the video game

Year of Release: The year in which the video game was released

Scores:

Metacritic User Score: The score given to the game by users.

Metacritic Critic Score: The score given to the game by critics.

Sales:

North America: The number of units sold in North America

Europe: The number of units sold in Europe

Japan: The number of units sold in Japan

Global: The total number of units sold worldwide

games<-read.csv('Video games sales.csv')
games$Platform<-as.factor(games$Platform)
games$Year_of_Release<-as.factor(games$Year_of_Release)
games$Genre<-as.factor(games$Genre)
games$Publisher<-as.factor(games$Publisher)
games$Developer<-as.factor(games$Developer)
games$Rating<-as.factor(games$Rating)

2 Exploratory Data Analysis

2.1 correlation Plot between Variables

# Plot the correlation matrix
corrplot(cor_matrix, method = "circle")

2.2 Single Platform vs Multi Platform comparison

As part of our research questions, we wanted to analyze how games performed when they were targeted towards a specific platform vs when they are released across multiple platforms.

library(dplyr)
library(ezids)
library(ggplot2)

grouped_by_platform <- split(games_final, games_final$Platform)
platform_summary <- lapply(grouped_by_platform, function(x) summary(x$Global_Sales))
boxplot(Global_Sales ~ Platform, data = games_final, main="Global Sales by Platform", xlab="Platform", ylab="Global Sales")

games_final <- games_final %>%
  group_by(Name) %>%
  mutate(Platform_Count = n())

single_platform_games <- filter(games_final, Platform_Count == 1)
multiple_platform_games <- filter(games_final, Platform_Count > 1)


average_sales_single_platform <- mean(single_platform_games$Global_Sales)
average_sales_multiple_platforms <- mean(multiple_platform_games$Global_Sales)

average_rating_single_platform <- mean(single_platform_games$Critic_Score, na.rm = TRUE)
average_rating_multiple_platforms <- mean(multiple_platform_games$Critic_Score, na.rm = TRUE)

#cat("Average Sales - Single Platform Games:", average_sales_single_platform, "\n")
#cat("Average Sales - Multiple Platform Games:", average_sales_multiple_platforms, "\n\n")

#cat("Average Rating - Single Platform Games:", average_rating_single_platform, "\n")
#cat("Average Rating - Multiple Platform Games:", average_rating_multiple_platforms, "\n")


t_test_result <- t.test(single_platform_games$Global_Sales, multiple_platform_games$Global_Sales)

# Extract p-value
p_value_t_test <- t_test_result$p.value

# Display p-value
#cat("P-value (t-test):", p_value_t_test, "\n")


combined_data <- rbind(data.frame(Sales = single_platform_games$Global_Sales, Platform = "Single Platform"),
                       data.frame(Sales = multiple_platform_games$Global_Sales, Platform = "Multiple Platforms"))

# Global Sales

ggplot(combined_data, aes(x = Platform, y = Sales, fill = Platform)) +
  geom_violin(trim = FALSE, scale = "width", width = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", color = "black", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Comparison of Global Sales for single platform vs multi platform games",
       x = "Platform Count",
       y = "Global Sales (Millions of Copies)") +
  theme(legend.position = "none")

As the plots above suggest, the global sales for games releasing on multiple platforms is higher than those targetted towards single platforms. This difference is statistically significant.

# Critic Score
combined_data <- rbind(data.frame(Score = single_platform_games$Critic_Score, Platform = "Single Platform"),
                       data.frame(Score = multiple_platform_games$Critic_Score, Platform = "Multiple Platforms"))


ggplot(combined_data, aes(x = Platform, y = Score, fill = Platform)) +
  geom_violin(trim = FALSE, scale = "width", width = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", color = "black", alpha = 0.8) +
  theme_minimal() +
  labs(title = "Comparison of Critic Scores",
       x = "Platform Count",
       y = "Critic Score") +
  theme(legend.position = "none")

# For single platform games

platform_occurrence <- single_platform_games %>%
  group_by(Platform) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

ggplot(platform_occurrence, aes(x = reorder(Platform, -Count), y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Occurrence of Platforms for Single-Platform Games",
       x = "Platform",
       y = "Count") +
  theme_minimal()

# For multi platform games

platform_occurrence_multi <- multiple_platform_games %>%
  group_by(Platform) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

ggplot(platform_occurrence_multi, aes(x = reorder(Platform, -Count), y = Count)) +
  geom_bar(stat = "identity", fill = "lightcoral", color = "black") +
  labs(title = "Occurrence of Platforms for Multi-Platform Games",
       x = "Platform",
       y = "Count") +
  theme_minimal()

platform_difference <- merge(platform_occurrence, platform_occurrence_multi, by = "Platform", suffixes = c("_Single", "_Multi"))
platform_difference$Difference <- platform_difference$Count_Multi - platform_difference$Count_Single

ggplot(platform_difference, aes(x = reorder(Platform, -Difference), y = Difference)) +
  geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
  labs(title = "Difference in Platform Occurrence between Multi and Single-Platform Games",
       x = "Platform",
       y = "Difference in Games Sold") +
  theme_minimal()

2.3 Genre and Developer vs Global Sales

We aimed to explore the significant factors influencing game sales, particularly focusing on the impact of the game’s genre and developer. To analyze this, we generated a scatter plot visualizing the global sales of the top 20 developers, with each point color-coded based on the corresponding game’s genre.

data=games_final

top_developers <- data %>%
  group_by(Developer) %>%
  summarise(Total_Sales = sum(Global_Sales, na.rm = TRUE)) %>%
  top_n(20, Total_Sales) %>%
  pull(Developer)

filtered_data <- data %>%
  filter(Developer %in% top_developers)

ggplot(filtered_data, aes(x = Developer, y = Global_Sales, color = Genre)) +
  geom_point(position = position_jitter(width = 0.3, height = 0), alpha = 0.7) +
  labs(title = "Global Sales for Top 20 Developers",
       x = "Developer",
       y = "Global Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

This reveals an interesting pattern. Some developers specialize in games from a specific genre such as Electronic Arts (EA) in sports and Codemasters in racing. These Genres tend to be the best selling games for these developers. On the other hand, there are some developers like Namco, Nintendo and Capcom that have spread their wings across all kinds of games with great success. Altogether, the genres of sports, action and racing seem to stand out amongst others as the best performing genres. The plot below confirms the most popular Genres by global sales

data$Genre <- factor(data$Genre, levels = names(sort(tapply(data$Global_Sales, data$Genre, sum), decreasing = FALSE)))

ggplot(data, aes(x = Genre, y = Global_Sales)) +
  geom_bar(stat = "identity", fill = "lightcoral") +
  coord_flip() +
  labs(title = "Global Sales by Genre",
       x = "Genre",
       y = "Global Sales") + theme_minimal()

  • Action games are the most popular, with total sales far exceeding the other genres with sales more than 1000.
  • Followed by Action genre sports are the second most popular genre , with sales around 750.
  • The least sold games we puzzles and strategy as we know most of these types of games are played by less amount of people

2.4 Yearly Sales by Region and Genre

Continuing our investigation, we looked to dive into the trends of annual video game sales across different regions. Our objective is to unravel the dynamic patterns and shifts in the popularity of video game genres over time. By visualizing the data on a yearly basis, we aim to discern evolving consumer preferences, identify emerging trends, and gain insights into the changing landscape of the gaming industry. This exploration will provide valuable context for understanding the trajectory of video game sales in distinct regions and shed light on the trends that have shaped the market’s evolution.

# Create a bar plot for NA region only with each genre a different color
na_genre_sales_plot <- ggplot(na_best_selling_genres, aes(x = Year_of_Release, y = NA_Sales, fill = Genre)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  
  labs(title = "NA Annual Sales with Best-Selling Genre Highlighted", x = "Year of Release", y = "Sales") +
  theme(legend.title = element_blank()) 

print(na_genre_sales_plot)

The graph displays the annual sales of the best-selling video game genres in North America, with each bar representing a specific year.

  • A period dominated by action games might be followed by a surge in the popularity of sports games.

2.5 Critic_Score Vs Sales

library(ggplot2)

ggplot(games_final, aes(x = Critic_Score, y = NA_Sales)) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Scatter Plot: Critic_Score vs. NA Sales",
       x = "Critic_Score",
       y = "NA Sales")

critic Score Vs NA Sales The spread is slightly more even, with a few games receiving high critic scores also achieving higher sales, but the overall trend seems to suggest that critic scores do not guarantee high sales.

ggplot(games_final, aes(x = Critic_Score, y = EU_Sales)) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Scatter Plot: Critic_Score vs. EU Sales",
       x = "Critic_Score",
       y = "EU Sales")

critic Score Vs EU Sales The European sales scatter plot indicates many games with lower sales figures across the spectrum of critic scores.There is a concentration of data points at the lower end of sales regardless of the critic score, suggesting many games have low sales figures despite varied reviews. A few data points with higher sales can be seen, implying that some well-reviewed games also achieve high sales

ggplot(games_final, aes(x = Critic_Score, y = JP_Sales)) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Scatter Plot: Critic_Score vs. JP Sales",
       x = "Critic_Score",
       y = "JP Sales")

critic score Vs Jp Sales The Japanese market scatter plot shows a more pronounced cluster of games with low sales, and there is a slightly higher concentration of games with higher critic scores that do not have corresponding high sales. This suggests a less strong correlation between critic scores and sales in Japan compared to North America and Europe.

ggplot(games_final, aes(x = Critic_Score, y =Global_Sales)) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Scatter Plot: critic_score vs. Global_ Sales",
       x = "Critic_Score",
       y = "Global_Sales")

critic Score Vs Global Sales The global sales scatter plot aggregates the data from all regions, showing a broad distribution of sales and critic scores. While there’s a visible trend that higher critic scores can correlate with higher sales

3 Model Building

With the comprehensive exploration of key elements influencing the success of video games, we are now ready to start the pivotal phase of model building. We aim to harness the power of machine learning models studies during our course to capture these trends and forecast the success of video games, focusing specifically on their global sales.rich insights gleaned from our exploratory analysis.

3.1 The Regression problem

Our initial approach involves developing predictive models that quantify global sales as a numeric value.

3.1.1 Linear Model

model<-lm(Global_Sales~Critic_Score+User_Score+Genre+Year_of_Release,data=trainData)

summary(model)
## 
## Call:
## lm(formula = Global_Sales ~ Critic_Score + User_Score + Genre + 
##     Year_of_Release, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64127 -0.25745 -0.09946  0.15256  1.49087 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.7500247  2.7185929   0.644  0.51978    
## Critic_Score       0.0094591  0.0004970  19.034  < 2e-16 ***
## User_Score        -0.0096350  0.0047565  -2.026  0.04285 *  
## GenreAdventure    -0.1413698  0.0287138  -4.923 8.78e-07 ***
## GenreFighting      0.0330152  0.0252321   1.308  0.19078    
## GenreMisc          0.0260103  0.0254548   1.022  0.30691    
## GenrePlatform      0.0026508  0.0249682   0.106  0.91545    
## GenrePuzzle       -0.1322916  0.0417685  -3.167  0.00155 ** 
## GenreRacing       -0.0142833  0.0213105  -0.670  0.50273    
## GenreRole-Playing -0.0807471  0.0197866  -4.081 4.56e-05 ***
## GenreShooter      -0.0135506  0.0186155  -0.728  0.46670    
## GenreSimulation   -0.0345418  0.0280356  -1.232  0.21798    
## GenreSports       -0.0092711  0.0184109  -0.504  0.61459    
## GenreStrategy     -0.2303472  0.0278174  -8.281  < 2e-16 ***
## Year_of_Release   -0.0009515  0.0013517  -0.704  0.48149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3717 on 4904 degrees of freedom
## Multiple R-squared:  0.1131, Adjusted R-squared:  0.1106 
## F-statistic: 44.68 on 14 and 4904 DF,  p-value: < 2.2e-16

The model aims to predict a dependent variable, likely related to video game sales, based on various independent variables like User_score,Critic_score,Genre and year of release

Coefficient The coefficients indicate the impact of each independent variable on the dependent variable. The independent variables in the model such as Critic_Score,User_Score, and GenreStrategy have significant effects on video game sales

R square The R-squared value (0.1131) indicates that the model explains approximately 11.31% of the variance in the dependent variable.

Adj R Square The adjusted R-squared (0.1106) accounts for the number of predictors in the model.

An R-squared value of -3.385177 is unusual.Negative R-squared values usually suggest that the model is performing worse than the Null model.Based on this R-squared value, the model is not considered good for explaining the variability in Global_Sales.

residuals <- model$residuals
hist(residuals, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals", col = "blue")
abline(v = mean(residuals), col = "red", lwd = 2)

In the graph we could see that the residuals are normaly distributed stating the assumptions of linear Regression model

3.2 The Classification problem

As we discovered, predicting the global sales as a numeric value is a challenging task due to the volatility and unpredictability of sales. While there are discernbale patterns on what makes a good game, the exact fluctuations in sales as a result of it are difficult to capture. Hence we move to a secondary approach.

Classifying Games as Hit or Miss

The second approach involves adding a columns called ‘Hit’ which classifies the game as a Hit or a Miss based on whether it lies in the top 25 percentile of games by global sales or not. This was decided based on the popularity of quantiles especially in boxplots in determining the upper echelon of video games.

3.2.1 Logistic Model

# Fit a logistic regression model
logit_model <- glm(Hit ~ Critic_Score + Developer  + Genre , 
                   data = train_data, family = binomial())
print(summary(logit_model))
## 
## Call:
## glm(formula = Hit ~ Critic_Score + Developer + Genre, family = binomial(), 
##     data = train_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.639119   0.381704 -12.154  < 2e-16 ***
## Critic_Score       0.053450   0.003159  16.921  < 2e-16 ***
## Developer1553      0.309054   0.422558   0.731  0.46454    
## Developer1561      0.390173   0.435695   0.896  0.37051    
## Developer266      -0.145614   0.408228  -0.357  0.72132    
## Developer307      -0.492524   0.489739  -1.006  0.31457    
## Developer434      -0.130454   0.405546  -0.322  0.74770    
## Developer452       0.556586   0.407152   1.367  0.17162    
## Developer456       0.865595   0.435744   1.986  0.04698 *  
## Developer480       0.516676   0.449320   1.150  0.25018    
## Developer827      -0.289683   0.436706  -0.663  0.50712    
## DeveloperOthers   -0.131287   0.329242  -0.399  0.69007    
## GenreAdventure    -0.960955   0.242681  -3.960 7.50e-05 ***
## GenreFighting     -0.061046   0.160102  -0.381  0.70299    
## GenreMisc          0.003359   0.161289   0.021  0.98338    
## GenrePlatform      0.006520   0.158578   0.041  0.96720    
## GenrePuzzle       -0.824972   0.318277  -2.592  0.00954 ** 
## GenreRacing       -0.135961   0.143275  -0.949  0.34265    
## GenreRole-Playing -0.344525   0.129743  -2.655  0.00792 ** 
## GenreShooter       0.023498   0.118709   0.198  0.84309    
## GenreSimulation   -0.325123   0.183257  -1.774  0.07604 .  
## GenreSports       -0.210337   0.127580  -1.649  0.09922 .  
## GenreStrategy     -1.142691   0.222714  -5.131 2.89e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5558.4  on 4917  degrees of freedom
## Residual deviance: 5104.6  on 4895  degrees of freedom
## AIC: 5150.6
## 
## Number of Fisher Scoring iterations: 4
# Plot the ROC curve
plot(roc_obj, main = "ROC Curve for Logistic Regression Model")

# Print AUC (Area Under the Curve)
#auc(roc_obj)

Area under the curve: 0.7107

# Assuming you have 'predictions' and 'test_data$Hit' from your logistic regression model
predicted_class <- ifelse(predictions > 0.5, "1", "0")
predicted_class <- factor(predicted_class, levels = c("0", "1"))

# Actual values
actual_class <- test_data$Hit

# Create a table for Actual vs Predicted
confusion_matrix <- table(Predicted = predicted_class, Actual = actual_class)

library(ggplot2)

# Assuming you have 'predicted_class' and 'actual_class' from your model
# Convert the confusion matrix to a data frame for plotting
confusion_matrix_df <- as.data.frame(confusion_matrix)

# Rename columns
names(confusion_matrix_df) <- c("Predicted", "Actual", "Freq")

# Plot the confusion matrix using ggplot2
ggplot(data = confusion_matrix_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  geom_text(aes(label = sprintf("%0.0f", Freq)), vjust = 1) +
  theme_minimal() +
  labs(fill = "Frequency", x = "Actual Class", y = "Predicted Class", title = "Confusion Matrix")

We are using logistic regression model to predict if a games is hit or not based on the threshold value. we are using Critic Score , Developer and Genre as predicotrs. For making logistic model at first we are using binning function. Binning top developers because there are likely many unique developers in the dataset. Without binning, the model would have too many variables (one for each developer), many of which would have little data, leading to overfitting and making the model less generalizable. By binning, you reduce the complexity of the model and focus on the most common developers.

Coefficients and p-values: These values represent the change in the log odds of the dependent variable (in your case, whether a game is a ‘hit’ or not) for a one-unit change in the predictor.

Critic_Score (Coefficient: 0.053450, p-value: < 2e-16) * This is a highly significant predictor. The positive coefficient indicates that higher critic scores are associated with a higher likelihood of a game being a hit. The extremely low p-value suggests this relationship is statistically significant.

Developer456 (Coefficient: 0.865595, p-value: 0.04698)

  • This specific developer (represented as ‘Developer456’) has a positive impact on the likelihood of a game being a hit. The p-value is just under 0.05, indicating that this result is statistically significant, although it’s on the borderline.

GenreAdventure (Coefficient: -0.960955, p-value: 7.50e-05)

  • Being an adventure game is associated with a lower likelihood of being a hit, as indicated by the negative coefficient. The p-value is very small, showing strong evidence against the null hypothesis.

GenrePuzzle (Coefficient: -0.824972, p-value: 0.00954)

Puzzle games are also less likely to be hits. The negative coefficient and its low p-value signify a strong and statistically significant relationship.

GenreRole-Playing (Coefficient: -0.344525, p-value: 0.00792)

  • Role-Playing games have a negative coefficient, suggesting they are less likely to be hits compared to the baseline genre. The p-value indicates this finding is statistically significant.

GenreStrategy (Coefficient: -1.142691, p-value: 2.89e-07)

  • Strategy games show a strong negative association with being a hit. The coefficient is significantly negative, and the very low p-value strongly supports its significance.

Null Devience:

  • Null deviance shows the goodness of fit of a model with no predictors (only the intercept) and residual deviance shows the goodness of fit of your model. A lower residual deviance compared to the null deviance indicates that the model is a better fit than the null model.

Accuracy(0.7606): * The above model has an accuracy of 76.06%, which means it correctly predicts whether a game is a hit or not 76.06% of the time.

Confusion Matrix:

  • True Negatives (TN): 897 - The model correctly predicted the non-hit games.
  • False Positives (FP): 273 - The model incorrectly predicted these games as hits.
  • False Negatives (FN): 21 - The model incorrectly predicted these hits as non-hits.
  • True Positives (TP): 37 - The model correctly predicted the hit games.

Summary: Based on the provided information, the logistic regression model exhibits an overall accuracy of 76.06%, which is higher than the no information rate, suggesting it has learned to predict better than a naive model. However, the model has a high sensitivity (97.71%) and low specificity (11.94%), indicating it is adept at predicting games that will not be hits (flops) but performs poorly in identifying successful hit games. The Area Under the Curve (AUC) of 0.7107 from the ROC analysis reveals a good discriminatory ability, but the model’s prediction bias towards non-hits is also evident, as shown by the high number of false negatives. This suggests that while the model is quite conservative and effective in identifying non-hit games, it needs improvement to balance its predictive performance for both hits and non-hits

3.2.2 Decision Trees

rpart.plot(tree_model, extra = 101, under = TRUE, cex = 0.8)

4 Conclusion

Our research has uncovered important insights into the gaming industry, revealing significant patterns and differences among various factors. However, it’s crucial to recognize some limitations. First, our data collection method using web scraping might have introduced errors and missing information. Second, our analysis is limited by the exclusion of certain platforms from Metacritic data, affecting the overall completeness of our findings. Despite these constraints, our research provides valuable knowledge about the industry’s dynamics, but it’s essential to interpret the results with awareness of these limitations.

5 References

-Chen, H., & Hu, Y. (2020). The impact of marketing campaigns on video game sales: A meta-analysis. Journal of Marketing, 84(4), 124-143. https://www.sciencedirect.com/science/article/pii/S0148296320304161

-Lin, Y., & Wu, C. (2021). The impact of player demographics on video game sales: A segmentation analysis. Computers in Human Behavior, 118, 106653. https://www.researchgate.net/publication/31977403_Segmentation_of_the_games_market_using_multivariate_analysis

-Duan, W., & Zhang, J. (2019). The impact of critic reviews on video game sales: Evidence from a large-scale panel dataset. Journal of Business Research, 104, 70-80. https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2023.4754

-[OIS] Diez et al. (2015), OpenIntro Statistics. https://leanpub.com/os